Skip to content

Add embedding-based detector #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Add embedding-based detector #2

wants to merge 2 commits into from

Conversation

RobGeada
Copy link
Contributor

@RobGeada RobGeada commented Jan 22, 2025

Adds a framework for defining detections based on a text-embedding classifier. The default configuration here uses MMLU as the training data for the classification and creates a multi-label text classifier to infer which of the 61 MMLU subjects a particular body of text belongs to. The detector endpoint then accepts the following arguments:

  • contents: List of texts to classify
  • allowList: Allowed list of subjects: all inbound texts must belong to at least one of these subjects to avoid flagging the detector
  • blockList: Blocked list of subjects: all inbounds texts must not belong to any of these subjects to avoid flagging the detector.
  • threshold: Defines the maximum distance a body of text can be from the subject centroid and still be classified into that subject. The default value is 0.75, while a threshold of >10 will classify every document into every subject. As such, values 0<threshold<1 are recommended.

@@ -0,0 +1,37 @@
# Embedding Classification Detector

# Setup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it could be useful to state that local Python must match up Python in the Containerfile? At present, python 3.9 will be downloaded inside the container, which may warrant upgrading?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be moved to a shared utils folder so that other detectors that require training can use this class ?


sys.path.insert(0, os.path.abspath(""))
# from common.scheme import TextDetectionHttpRequest, TextDetectionResponse
import os

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete duplicate import

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants